Solving a Complex Prisoner's Dilemma with Self-Modifying Policies
نویسنده
چکیده
Self-modifying policies (SMPs) trained by the success-story algorithm (SSA) have been successfully applied to various diicult reinforcement learning tasks (Schmidhuber et al. 1997a, 1997b). Here we present new results on an application where two cooperating/competing an-imats have to solve a complex version of the prisoner's dilemma. 1 Overview SMP/SSA. An animat's modiiable components that determine its behavior are called its policy. An algorithm that modiies the policy is called a learning algorithm. If the learning algorithm has modiiable components represented as part of the policy, then we speak of a self-modifying policy (SMP). SMPs can modify the way they modify themselves etc. They are of interest in situations where the initial learning algorithm itself can be improved by experience. How can we force some (stochas-tic) SMP to trigger better and better self-modiications? The success-story algorithm (SSA) addresses this question in a lifelong reinforcement learning context. During the learner's lifetime , SSA is occasionally called at times computed according to SMP itself. SSA uses backtrack-ing to undo those SMP-generated SMP-modiications that have not been empirically observed to trigger lifelong reward accelerations (measured up until the current SSA call | this evaluates the long-term eeects of SMP-modiications setting the stage for later SMP-modiications). SMP-modiications that survive SSA represent a lifelong success history. This scheme has been used to solve several diicult reinforcement learning tasks unsolvable by more traditional RL methods tasks (Schmidhuber et al., 1997a, 1997b). Outline. Basic principles of SMPs and SSA will be brieey reviewed in Section 2. Section 3 will describe an animat with a complex stochastic SMP, and Section 4 will present experiments with two such animats that need to learn to cooperate in a relatively complex environment to accelerate their reinforcement intake. In what follows we will review basic concepts presented in (Schmidhuber et al., 1997a). Each animat lives in an unknown environment E from time 0 to unknown time T. Life is one-way: even if it is decomposable into numerous consecutive \learning trials", time will never be reset. The animat has an internal state S and a self-modifying policy SMP. Both S and SMP are variable dynamic data structures innuencing probabilities of actions to be executed by the animat. Between time 0 and T, the animat repeats the following cycle over and over again (A denotes a set of possible actions): REPEAT: select and execute a 2 A with conditional probability P(a j SMP; S): 1 …
منابع مشابه
Evolutionary Dynamics in Game-Theoretic Models
A number of evolutionary models based on the iterated Prisoner's Dilemma with noise are discussed. Different aspects of the evolutionary behaviour are illustrated (i) by varying the trickiness of the game (iterated game, mistakes, misunderstandings, choice of payoff matrix), (ii) by introducing spatial dimensions, and (iii) by modifying the strategy space and the representation of strategies. O...
متن کاملState Sovereign Immunity and Stare Decisis: Solving the Prisoners' Dilemma within the Court
متن کامل
Electronic Theses and Dissertations UC San Diego
Abstract: Behavior analytic discussions of self-control have focused on temporal discounting as the primary index of selfcontrol behavior. In this measure, choice between discrete, mutually exclusive, delayed outcomes is observed. The outcome of this self-control measure is well described by hyperbolic models of intertemporal choice. In the last ten years, a second measure of self-control has b...
متن کاملAnalyzing Social Network Structures in the Iterated Prisoner's Dilemma with Choice and Refusal
The Iterated Prisoner's Dilemma with Choice and Refusal (IPD/CR) [46] is an extension of the Iterated Prisoner's Dilemma with evolution that allows players to choose and to refuse their game partners. From individual behaviors, behavioral population structures emerge. In this report, we examine one particular IPD/CR environment and document the social network methods used to identify population...
متن کاملCollective Action and the Evolution of Social Norms
W ith the publication of The Logic of Collective Action in 1965, Mancur Olson challenged a cherished foundation of modern democratic thought that groups would tend to form and take collective action whenever members jointly benefitted. Instead, Olson (1965, p. 2) offered the provocative assertion that no self-interested person would contribute to the production of a public good: " [U ] nless th...
متن کامل